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Abstract 


In  a  distributed  system  using  message  logging  and  checkpointing  to  provide  fault  tol¬ 
erance,  there  is  always  a  unique  maximum  recoverable  system  state,  regardless  of  the 
message  logging  protocol  used.  The  proof  of  this  relies  on  the  observation  that  the  set 
of  system  states  that  have  occurred  during  any  single  execution  of  a  system  forms  a 
lattice,  with  the  sets  of  consistent  and  recoverable  system  states  as  sublattices.  The 
maximum  recoverable  system  state  never  decreases,  and  if  all  messages  are  eventually 
logged,  the  domino  effect  cannot  occur.  This  paper  presents  a  general  model  for  rea¬ 
soning  about  recovery  in  such  a  system  and,  based  on  this  model,  an  efficient  algorithm 
for  determining  the  maximum  recoverable  system  state  at  any  time.  This  work  unifies 
existing  approaches  to  fault  tolerance  based  on  message  logging  and  checkpointing,  and 
improves  on  existing  methods  for  optimistic  recovery  in  distributed  systems.^ fL  ) 
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1  Introduction 


Message  logging  and  checkpointing  can  be  used  to  provide  an  effective  fault- tolerance  mechanism 
in  a  distributed  system  in  which  all  process  communication  is  through  messages.  Each  message 
received  by  a  process  is  logged  on  stable  storage  [Lampson79],  and  each  process  is  occasionally 
checkpointed  to  stable  storage,  but  no  coordination  is  required  between  the  checkpoints  of  dif¬ 
ferent  processes.  Between  received  messages,  the  execution  of  each  process  is  assumed  to  be 
deterministic. 

The  protocols  used  for  message  logging  are  typically  pessimistic.  With  these  protocols,  each 
message  is  synchronously  logged  as  it  is  received,  either  by  blocking  the  receiver  until  the  message 
is  logged  [Borg83,  Powell83],  or  by  blocking  the  receiver  if  it  attempts  to  send  a  new  message  before 
this  received  message  is  logged  [Johnson87].  Recovery  based  on  pessimistic  message  logging  is 
straightforward.  A  failed  process  is  restarted  from  its  last  checkpoint,  and  all  messages  originally 
received  by  this  process  since  the  checkpoint  are  replayed  to  it  from  the  log  in  the  same  order  as 
they  were  received  before  the  failure.  The  process  reexecutes  based  on  these  messages  to  its  state 
at  the  time  of  the  failure.  Messages  sent  by  the  process  during  recovery  are  ignored  since  they  are 
duplicates  of  those  sent  before  the  failure. 

On  the  other  hand,  optimistic  protocols  perform  the  message  logging  asynchronously  [Strom85]. 
The  receiver  continues  to  execute  normally,  and  received  messages  are  logged  later,  for  exam¬ 
ple  by  grouping  several  messages  and  writing  them  to  stable  storage  in  a  single  operation.  The 
receiver  of  a  message  depends  on  the  state  of  the  sender,  though.  If  the  sender  fails  and  can¬ 
not  be  recovered  (for  example,  because  some  message  has  not  been  logged),  the  receiver  be¬ 
comes  an  orphan  process,  and  its  state  must  be  rolled  back  during  recovery  to  a  point  before 
this  dependency  was  created.  If  rolling  back  this  process  causes  other  processes  to  become  or¬ 
phans,  they  too  must  be  rolled  back  during  recovery.  The  domino  effect  [Randell75,  Russell80] 
is  an  uncontrolled  propagation  of  such  rollbacks  and  must  be  avoided  to  guarantee  progress  in 
spite  of  failures.  Recovery  based  on  optimistic  message  logging  must  find  the  “most  recent” 
combination  of  process  states  that  can  be  recreated,  such  that  none  of  the  process  states  is  an 
orphan. 

Optimistic  message  logging  protocols  appear  to  be  desirable  in  systems  in  which  failures  are  rare 
and  failure- free  performance  is  of  primary  concern.  Since  optimistic  protocols  avoid  synchronization 
delays  during  message  logging,  performance  in  the  absence  of  failures  is  improved.  Although  the 
required  recovery  procedure  is  then  more  complicated,  this  procedure  is  only  required  when  a  failure 
occurs. 

Section  2  of  this  paper  presents  a  general  model  for  reasoning  about  these  recovery  methods  in 
distributed  systems.  With  this  model,  we  show  that  there  is  always  a  unique  maximum  recoverable 
system  state,  which  never  decreases,  and  that  if  all  messages  received  are  eventually  logged,  the 
domino  effect  cannot  occur.  Based  on  this  model.  Section  3  describes  and  proves  the  correctness 
of  our  algorithm  for  determining  the  maximum  recoverable  system  state  at  any  time.  The  algo¬ 
rithm  requires  no  additional  messages  in  the  system,  and  supports  recovery  from  any  number  of 
concurrent  failures,  including  a  total  failure.  Our  model  and  algorithm  make  no  assumption  of 
the  message  logging  protocol  used;  they  support  both  pessimistic  and  optimistic  logging  protocols, 
although  pessimistic  protocols  do  not  require  their  full  generality.  Section  4  then  relates  this  work 
to  existing  fault-tolerance  methods  published  in  the  literature  and  discusses  the  effect  of  differ¬ 
ent  message  logging  protocols  on  our  model  and  algorithm.  Finally,  Section  5  summarizes  the 
contributions  of  this  work  and  draws  some  conclusions. 
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The  Model 

2.1  Process  States 

Each  time  a  process  receives  an  input  message,  it  begins  a  new  state  interval,  a  deterministic 
sequence  of  execution  based  only  on  the  state  of  the  process  at  the  time  that  the  message  is  received 
and  on  the  contents  of  the  message  itself.  Within  each  process,  each  state  interval  is  identified  by 
a  unique  sequential  state  interval  index,  which  is  simply  a  count  of  the  number  of  input  messages 
that  the  process  has  received. 

All  dependencies  of  a  process  i  on  some  process  j  can  be  encoded  simply  as  the  maximum  index 
of  any  state  interval  of  process  j  on  which  process  *  depends.  This  encoding  is  possible  since  the 
execution  of  a  process  within  each  state  interval  is  deterministic  and  since  any  state  interval  in  a 
process  naturally  also  depends  on  all  previous  intervals  of  the  same  process. 

All  dependencies  of  any  process  i  can,  therefore,  be  represented  by  a  dependency  vector 

d<  =  (6.)  =  (61,62,63,  ...,6n)  , 

where  n  is  the  total  number  of  processes  in  the  system.  Component  j  of  process  t’s  dependency 
vector,  6j,  gives  the  maximum  index  of  any  state  interval  of  process  j  on  which  process  t  currently 
depends.  Component  i  of  process  i’s  own  dependency  vector  is  always  set  to  the  index  of  process 
i’s  current  state  interval.  If  process  i  has  no  dependency  on  any  state  interval  of  some  process  j, 
then  6j  is  set  to  i.,  which  is  less  than  all  possible  state  interval  indices. 

Processes  cooperate  to  maintain  their  current  dependency  vectors  by  tagging  all  messages  sent 
with  their  current  state  interval  index  and  by  remembering  in  each  process  the  maximum  index 
in  any  message  received  from  each  other  process.  During  any  single  execution  of  the  system,  the 
dependency  vector  for  any  process  is  uniquely  determined  by  the  state  interval  index  of  the  process. 
No  component  of  the  dependency  vector  of  any  process  can  decrease  through  normal  execution  of 
the  process. 


2.2  System  States 

The  state  of  the  system  is  the  composition  of  the  states  of  all  component  processes  of  the  system 
and  may  be  represented  by  an  n  x  n  dependency  matrix.  Taking  the  dependency  vector,  dt,  of  each 
process  i  in  the  system,  the  dependency  matrix 
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can  be  formed,  where  row  i,  6ij,  1  <  j  <  n,  is  the  dependency  vector  for  process  i.  Since  component 
i  of  process  t’s  dependency  vector  is  always  the  index  of  process  t’s  current  state  interval,  the 
diagonal  of  the  dependency  matrix,  6a,  1  <  t  <  n,  shows  the  current  state  interval  index  of  each 
process  in  the  system. 

Let  S  be  the  set  of  all  system  states  that  have  occurred  during  any  single  execution  of  some 
system.  The  set  S  forms  a  partial  order  called  the  system  history  relation,  in  which  one  system 
state  precedes  another  if  and  only  if  it  must  have  occurred  first  during  the  execution.  This  relation 
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can  be  expressed  in  terms  of  the  state  interval  index  of  each  process  as  shown  in  the  dependency 
matrix. 

Definition  1  If  A  =  [a.,]  and  B  =  [/?.„]  are  system  states  in  S,  then 


A  <  B  <=>  V  i  [<*,-<  <  /?,  i]  . 


This  partial  order  differs  from  that  defined  by  Lamport’s  happened  before  relation  [Lamport 78]  in 
that  it  orders  the  system  states  that  result  from  events  rather  than  the  events  themselves,  and  that 
only  state  intervals  (started  by  the  receipt  of  a  message)  constitute  events. 

For  example,  Figure  1  shows  a  system  of  four  communicating  processes.  The  horizontal  lines 
represent  the  execution  of  each  process,  each  arrow  represents  a  message  from  one  process  to  an¬ 
other,  and  the  numbers  give  the  index  of  the  state  interval  started  by  the  receipt  of  each  message. 
Consider  the  two  possible  system  states  A  and  B,  where  in  state  A,  message  a  has  been  received 
but  message  b  has  not,  and  in  state  B,  message  b  has  been  received  but  message  a  has  not.  These 
states  can  be  expressed  by  the  dependency  matrices 


X  JL‘ 

X  X 
2  0  ‘ 

2  (D. 

States  A  and  B  are  incomparable  under  the  system  history  relation,  which  can  be  seen  by  comparing 
the  circled  values  on  the  diagonals  of  these  two  dependency  matrices. 

2.3  The  System  History  Lattice 

A  system  state  describes  the  set  of  messages  that  have  been  received  by  each  process.  Two  system 
states  in  S  can  be  combined  to  form  their  union  such  that  each  process  has  received  all  of  the 
messages  that  it  has  in  either  of  the  two  original  system  states.  This  can  be  expressed  in  terms  of 
the  dependency  matrices  describing  these  system  states  by  choosing  for  each  process  the  row  that 
has  the  largest  state  interval  index  of  the  corresponding  rows  in  the  original  matrices. 
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Figure  1  The  system  history  partial  order 
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Definition  2  If  A  =  [a..]  and  B  =  [/?,«]  are  system  states  in  S,  then  the  union  of 
A  and  B  is  A  U  B  =  [7.,], 


Vi 


<*«.  if 

ft,  otherwise 


Likewise,  the  intersection  of  two  system  states  in  S  can  be  formed  such  that  each  process  has 
received  only  those  messages  that  it  has  in  both  of  the  two  original  system  states.  This  can  be 
formed  from  the  dependency  matrices  describing  these  states  by  choosing  for  each  process  the  row 
that  has  the  smallest  state  interval  index  of  the  corresponding  rows  in  the  original  matrices. 

Definition  3  If  A  =  [a,,]  and  B  =  [/3»,]  are  system  states  in  S ,  then  the  intersection 
of  A  and  B  is  A  fl  B  =  [6, ,], 


{  if 
1  ft,  otherwise 


Continuing  the  example  of  Section  2.2  in  Figure  1,  the  union  and  intersection  of  states  A  and 
B  can  be  formed  by  choosing  the  proper  rows  from  these  two  matrices  to  get 
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The  following  theorem  introduces  the  system  history  lattice  formed  by  the  set  of  system  states 
that  have  occurred  during  any  single  execution  of  some  system,  ordered  by  the  system  history 
relation. 


Theorem  1  The  set  S,  ordered  by  the  system  history  relation,  forms  a  lattice.  For 
any  A,  B  6  <S,  the  least  upper  bound  of  A  and  B  is  A  U  B,  and  the  greatest  lower  bound 
of  A  and  B  is  A  ft  B. 

Proof  Straightforward  from  the  construction  of  system  state  union  and  intersection  in  Definitions 
2  and  3,  □ 


2.4  Consistent  System  States 

A  system  state  is  consistent  if  and  only  if  all  messages  received  by  all  processes  have  either  already 
been  sent  in  the  state  of  the  sending  process  or  can  deterministically  be  sent  by  that  process  in  the 
future.  Since  process  execution  within  a  state  interval  is  deterministic,  any  message  sent  before  the 
end  of  the  current  state  interval  can  deterministically  be  sent,  but  messages  sent  after  this  cannot 
be.  Only  a  consistent  system  state  would  be  possible  to  be  reached  through  normal  execution 
of  the  system  from  its  initial  state,  if  an  instantaneous  snapshot  of  the  entire  system  could  be 
observed  [Chandy85]. 

Any  messages  shown  by  the  system  state  to  be  sent  but  not  yet  received  do  not  cause  the 
system  state  to  be  inconsistent.  These  messages  can  be  handled  by  the  normal  mechanism  for 
reliable  message  delivery,  if  any,  used  by  the  underlying  system.  In  particular,  suppose  such  a 
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message  m  was  received  by  some  process  i  after  the  state  of  process  i  was  observed  to  form  the 
system  state,  and  suppose  process  i  then  sent  some  message  n  (such  as  an  acknowledgement  of 
message  m ),  which  could  show  this  receipt.  If  message  n  has  been  received  in  this  system  state, 
the  state  will  be  inconsistent  because  message  n  (not  message  m)  is  shown  as  having  been  received 
but  not  yet  sent.  If  message  n  has  not  been  received  yet,  no  effect  of  either  message  can  be  seen  in 
the  system  state,  and  it  is  thus  still  consistent. 

If  a  system  state  is  consistent,  then  no  process  depends  on  a  state  interval  of  the  sender  greater 
than  the  sender’s  current  state  interval  in  the  dependency  matrix.  For  each  column  j  of  the  depen¬ 
dency  matrix,  no  element  in  that  column  may  be  larger  than  the  element  on  the  diagonal  of  the 
matrix. 

Definition  4  If  D  =  [6„.]  is  some  system  state  in  S,  D  is  consistent  if  and  only  if 


^  *\  j  [^»  j  ^  j]  . 


Let  the  set  C  C  S  be  the  set  of  consistent  system  states  that  have  occurred  during  any  single 
execution  of  some  system.  Thus, 


C  =  {D(~,S|Dis  consistent }  . 


For  example,  consider  the  system  of  three  processes  whose  execution  is  shown  in  Figure  2.  The 
state  of  each  process  here  is  observed  where  the  curve  crosses  the  execution  line  for  that  process, 
and  the  resulting  system  state  is  represented  by  the  dependency  matrix 


D  =  [*..]  = 


1  ® 
0  (2) 
1  2 


1 

0 

1 


This  system  state  is  not  consistent  since  process  1  has  received  a  message  (to  begin  state  interval  1) 
from  process  2  that  has  not  been  sent  yet  by  process  2  and  cannot  be  deterministically  sent  in  the 
future.  This  inconsistency  is  shown  in  the  dependency  matrix  since  612  is  greater  than  622- 


Lemma  1  The  set  C  forms  a  sublattice  of  the  system  history  lattice. 


Figure  2  An  inconsistent  system  state 
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Proof  It  suffices  to  show  that  for  any  A.BeCAUBeC  and  A  n  B  e  C.  Let  A  =  [a..]  and 
B  =  [/?..]. 

(A  U  B  €  C):  Let  C  =  [7..]  =  AUB.  In  each  column  j  of  C,  either  7 \j  <  a;;  or  <  3, } 
for  all  i,  since  A  €  C  and  B  €  C.  Since  7jj  =  ma x(ajj,(3jj),  7 ij  <  7 j:  for  all  i  as  well.  Therefore. 
AUB  ec. 

(AnB  6  C ):  Let  D  =  [&,.]  =  AflB.  By  Definition  3  and  since  no  element  in  the  dependency 
vector  for  any  process  ever  decreases  as  the  process  executes  <5, j  =  min(a,j, for  all  i  and 
j.  This  implies  that  tin  <  o.j  and  S,:  <  3iy  Since  A  and  B  are  consistent,  atJ  <  aJ}  and 
0ij  <  &jj-  Combining  this  with  the  previous  result  yields  j  <  ctjj  and  6{j  <  (ijy  This  implies 
that  6ij<  vain(aj  j, (3j  j),  and  thus  StJ  <  6j for  all  i  and  j.  Therefore,  A  fl  B  €  C.  □ 


2.5  Message  Logging  and  Checkpointing 

A  message  is  called  logged  if  and  only  if  its  data  and  the  index  of  the  state  interval  that  it  started 
in  its  receiver  process  are  both  recorded  on  stable  storage.  The  predicate  logged(i,  cr)  is  true  if  and 
only  if  the  message  that  started  state  interval  a  in  process  t  is  logged. 

The  predicate  checkpoint^,  0)  is  true  if  and  only  if  there  exists  a  checkpoint  on  stable  storage 
that  records  the  state  of  process  i  in  state  interval  a.  When  a  process  is  created,  it  is  immediately 
checkpointed  before  it  begins  execution,  and  thus,  checkpoint^',  0)  is  true  for  all  processes  i. 

For  every  state  interval  a  of  some  process,  there  must  be  some  checkpoint  on  stable  storage  for 
that  process  with  a  state  interval  index  no  larger  than  cr,  since  there  is  at  least  always  a  checkpoint 
on  stable  storage  for  state  interval  0. 

Definition  5  The  effective  checkpoint  for  a  state  interval  a  of  some  process  i  is  the 
checkpoint  on  stable  storage  for  process  i  with  the  largest  state  interval  index  t  such 
that  e  <  a. 

A  state  interval  of  a  process  is  called  stable  if  and  only  if  all  messages  received  by  the  process 
to  start  state  intervals  after  its  effective  checkpoint  are  logged.  The  predicate  stable(i,o)  is  true  if 
and  only  if  state  interval  a  of  process  i  is  stable. 

Definition  6  If  o  is  the  state  interval  index  for  some  process  t,  and  if  e  is  the  state 
interval  index  of  the  effective  checkpoint  for  state  interval  0  of  process  t,  then  state 
interval  a  of  process  i  is  stable  if  and  only  if 

Vo,e<a<ff  [logged(i,a)|  . 

Any  stable  state  interval  o  for  a  process  can  be  recreated  by  restoring  the  process  from  the 
effective  checkpoint  (with  state  interval  index  e)  and  replaying  to  it  in  order  any  logged  messages 
to  begin  state  intervals  c+1  through  cr. 

The  checkpoint  of  a  process  includes  the  complete  current  dependency  vector  for  the  process. 
Each  logged  message,  though,  only  gives  the  single  dependency  created  in  the  receiver  by  this 
message.  The  complete  dependency  vector  for  any  stable  state  interval  of  some  process  is  always 
known,  though,  since  all  messages  that  started  state  intervals  since  the  effective  checkpoint  must 
be  logged. 

2.6  Recoverable  System  States 

A  system  state  is  called  recoverable  if  and  only  if  all  component  process  states  are  stable  and  the 
resulting  system  state  is  consistent.  To  recover  the  state  of  the  system,  it  must  be  possible  to 


6 


recover  the  states  of  the  component  processes,  and  for  this  system  state  to  be  meaningful,  it  must 
be  possible  to  have  reached  this  state  through  normal  execution  of  the  system  from  its  initial  state. 

Definition  7  If  D  =  [<5„„]  is  some  system  state  in  S,  D  is  recoverable  if  and  only  if 

DeC  A  V  i  [stable(i,  ]  . 

Let  the  set  7Z  C  S  be  the  set  of  recoverable  system  states  that  have  occurred  during  any  single 
execution  of  some  system.  Thus, 

7£  =  {D(E<S|Dis  recoverable  }  . 

Since  only  consistent  system  states  can  be  recoverable,  7Z  C  C  C  S. 

Lemma  2  The  set  7Z  forms  a  sublattice  of  the  system  history  lattice. 

Proof  For  any  A,  B  €  72,  AUB  €  C  and  A  fl  B  E  C,  by  Lemma  1.  Since  the  state  of  each  process 
in  A  and  B  is  stable,  all  process  states  in  A  U  B  and  A  fl  B  are  stable  as  well.  Thus,  A  U  B  E  7Z 
and  A  fl  B  E  7Z,  and  7Z  forms  a  sublattice.  □ 


2.7  The  Current  Recovery  State 

In  recovering  after  a  failure,  we  wish  to  restore  the  state  of  the  system  to  the  “most  recent”  re¬ 
coverable  state  that  is  possible  from  the  information  available,  in  order  to  minimize  the  amount 
of  reexecution  necessary  to  complete  the  recovery.  The  system  history  lattice  corresponds  to  this 
notion  of  time,  and  the  following  theorem  establishes  the  existence  of  a  single  maximum  recoverable 
state  under  this  ordering. 

Theorem  2  There  is  always  a  unique  maximum  recoverable  system  state  in  S. 

Proof  71  C  <S,  and  by  Lemma  2,  A U B  E  71  for  any  A, B  E  71-  Since  A  ^  AUB  and  B  <  A U B, 
the  unique  maximum  in  S  is  simply 

U  D, 

Den 

which  must  be  unique  since  71  forms  a  sublattice  of  the  system  history  lattice.  □ 

The  maximum  recoverable  system  state  at  any  time  is  called  its  current  recovery  state.  The 
following  lemma  shows  that  the  current  recovery  state  of  the  system  never  decreases. 

Lemma  3  If  the  current  recovery  state  of  the  system  is  R  =  [/>•„],  then,  for  each 
process  i,  the  system  can  always  be  recovered  without  needing  to  roll  back  any  state 
interval  o  < 

Proof  R  will  always  remain  consistent,  and  for  each  process  i,  state  interval  pa  will  always  remain 
stable.  Since  71  forms  a  sublattice,  any  new  current  recovery  state  established  after  R  must  be 
greater  than  R  in  the  lattice.  By  Definition  1,  this  implies  that  the  state  interval  index  for  each 
process  in  any  new  current  recovery  state  must  be  greater  than  or  equal  to  Therefore,  for  each 
process  i,  no  state  interval  <r  <  pa  will  ever  need  to  be  rolled  back.  □ 
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Corollary  1  If  all  messages  received  by  executing  processes  are  eventually  logged, 
there  is  no  possibility  of  the  domino  effect  in  the  system. 

Proof  If  all  messages  are  eventually  logged,  all  state  intervals  of  all  processes  eventually  become 
stable  by  Definition  6,  and  thus  new  recoverable  states  must  become  possible  through  Definition  7. 
By  Lemma  3,  these  states  will  never  need  to  be  rolled  back.  □ 


2.8  Committing  Output 

If  some  state  interval  of  a  process  must  be  rolled  back  to  recover  a  consistent  system  state,  any 
output  messages  sent  while  that  state  interval  is  being  reexecuted  after  recovery  may  not  be  the 
same  as  those  originally  sent.  Any  processes  that  received  such  messages  will  be  orphans  and  must 
also  be  rolled  back  to  a  point  before  these  messages  were  received. 

However,  messages  sent  to  the  outside  world,  such  as  those  to  the  user’s  display  terminal,  cannot 
be  treated  in  the  same  way.  Since  the  outside  world  generally  cannot  be  rolled  back,  any  messages 
sent  to  the  outside  world  must  be  delayed  until  it  is  known  that  the  state  interval  from  which  they 
were  sent  will  never  need  to  be  rolled  back,  at  which  time  they  may  be  committed  by  releasing 
them.  This  theorem  establishes  when  it  is  safe  to  commit  an  output  message  sent  to  the  outside 
world. 

Corollary  2  If  the  current  recovery  state  of  the  system  is  R  =  [p,»],  then  any  message 
sent  by  a  process  i  from  a  state  a  <  p„  may  be  committed. 

Proof  Follows  directly  from  Lemma  3.  □ 


2.9  Garbage  Collection 

While  the  system  is  operating,  checkpoints  and  logged  messages  accumulate  on  stable  storage  in 
case  they  are  needed  for  some  future  recovery.  This  data  may  be  removed  from  stable  storage 
whenever  doing  so  will  not  interfere  with  the  ability  of  the  system  to  recover  as  needed.  The 
following  two  theorems  establish  when  this  can  safely  be  done. 

Corollary  3  Let  R  =  [p«.]  be  the  current  recovery  state.  For  each  process  i,  if  e,  is 
the  state  interval  index  of  the  effective  checkpoint  for  its  state  interval  p, ,,  then  any 
checkpoint  for  process  t  with  state  interval  index  o  <  ti  may  be  released  from  stable 
storage. 

Proof  Follows  directly  from  Lemma  3.  □ 

Corollary  4  Let  R  =  [p.,]  be  the  current  recovery  state.  For  each  process  i,  if  c,  is 
the  state  interval  index  of  the  effective  checkpoint  for  its  state  interval  p,,,  then  any 
message  that  begins  a  state  interval  in  process  i  with  index  o  <  may  be  released  from 
stable  storage. 

Proof  Follows  directly  from  Lemma  3.  □ 
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3  Recovery  State  Algorithm 

3.1  Introduction 


As  the  system  executes,  new  logged  messages  and  checkpoints  arrive  on  stable  storage.  Occasionally, 
some  combination  of  this  information  may  define  a  new  current  recovery  state  by  creating  a  new 
recoverable  system  state  greater  than  the  existing  current  recovery  state.  Theorem  2  of  Section  2 
established  that  there  is  always  a  unique  maximum  recoverable  state  at  any  time.  Conceptually,  this 
state  may  be  found  by  an  exhaustive  search  of  all  combinations  of  stable  process  state  intervals  until 
the  maximum  combination  is  found.  However,  such  a  search  would  be  too  expensive  in  practice, 
and  an  effective  means  of  limiting  this  search  space  is  important. 

The  recovery  state  algorithm  monitors  checkpoints  and  logged  messages  as  they  arrive  on  stable 
storage  and  decides  if  each  allows  an  advance  in  the  current  recovery  state  of  the  system.  The 
algorithm  is  invoked  each  time  a  process  state  interval  becomes  stable;  it  is  incremental  in  that  it 
only  examines  information  that  has  changed  since  its  last  execution,  rather  than  recomputing  the 
entire  current  recovery  state  on  each  execution.  Since  it  only  uses  information  on  stable  storage,  it 
can  handle  any  number  of  concurrent  process  failures. 

3.2  The  Basic  Algorithm 

Each  time  some  new  state  interval  a  of  some  process  k  becomes  stable,  the  algorithm  attempts 
to  form  a  new  current  recovery  state  in  which  the  state  of  process  k  is  advanced  to  state  interval 
a.  It  does  so  by  including  any  state  intervals  from  other  processes  that  are  necessary  to  make 
this  new  system  state  consistent.  The  check  for  consistency  is  performed  by  a  direct  application  of 
the  definition  of  system  state  consistency  from  Section  2.  The  algorithm  succeeds  if  all  such  state 
intervals  included  are  stable,  making  this  new  consistent  system  state  composed  entirely  of  stable 
process  state  intervals.  Otherwise,  no  new  current  recovery  state  is  possible. 

An  outline  of  the  basic  recovery  state  algorithm  is  shown  below.  Some  details  are  omitted  from 
this  outline  for  clarity;  these  will  be  discussed  later  after  the  basic  algorithm  is  described.  Let 
R  =  [p„.]  be  the  current  recovery  state  of  the  system.  When  state  interval  <j  of  process  k  becomes 
stable,  the  following  steps  are  taken  by  the  algorithm: 

1.  If  cr  <  pkk,  then  exit  the  algorithm,  since  the  current  recovery  state  is  already  in  advance  of 
state  interval  a  of  process  k. 

2.  Make  a  new  dependency  matrix  D  =  [6,„]  from  R,  with  row  k  replaced  by  the  dependency 
vector  for  state  interval  o  of  process  k. 

3.  Loop  on  step  3  while  D  is  not  consistent.  That  is,  loop  while  there  exists  some  i  and  j  for 
which  6,j  >  6jj,  which  shows  that  some  process  i  depends  on  a  state  interval  of  process  j 
greater  than  process  j's  current  state  interval  in  D. 

Find  a  stable  state  interval  a  >  6ij  of  process  j.  If  state  interval  6{j  is  stable,  let  a  be  <5,  j ; 
otherwise,  choose  some  later  state  interval  of  process  j  for  a,  if  one  exists: 

(a)  If  no  such  state  interval  exists  for  a  that  is  stable,  exit  the  algorithm,  but  remember  to 
recheck  this  later. 

(b)  Otherwise,  replace  row  j  of  D  with  the  dependency  vector  for  state  interval  a  of  process 

j- 
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4.  The  dependency  matrix  D  is  now  consistent  and  composed  only  of  stable  process  state  in¬ 
tervals.  It  is  thus  recoverable.  Replace  R  with  this  new  system  state  D,  making  it  the  new 
current  recovery  state. 

3.3  Some  Details 

Lemma  4  The  state  interval  a  chosen  for  process  j  during  each  iteration  in  step  3 
must  be  the  minimum  a  >  <5^  that  is  stable. 

Proof  As  a  process  executes,  no  element  of  its  dependency  vector  can  decrease.  Thus,  the  depen¬ 
dencies  of  any  state  interval  of  process  j  after  this  minimum  a  will  be  at  least  as  large  as  those  of 
state  interval  a.  Clearly,  if  state  interval  6ij  is  stable,  its  dependencies  will  be  exactly  only  those 
that  are  necessary;  any  later  state  interval  of  process  j  may  have  additional  dependencies  that  state 
interval  &ij  does  not  have.  Using  the  minimum  set  of  dependencies  possible  w'ith  the  stable  process 
states  that  are  available  will  restrict  the  solutions  the  least.  □ 


Lemma  5  The  comparisons  in  step  3  to  check  if  D  is  a  consistent  system  state  may 
be  made  in  any  order  without  affecting  the  final  resulting  dependency  matrix. 

Proof  Since  the  only  change  made  to  D  during  the  loop  of  step  3  is  the  replacement  of  row  j 
with  the  dependency  vector  for  state  interval  a,  the  only  effect  that  the  order  of  these  comparisons 
has  is  the  order  in  which  these  row  replacements  are  performed. 

First,  each  replacement  of  row  j  can  only  increase  6jj,  since  row  j  is  only  replaced  when  6i:  > 
Sjj,  and  the  new  dependency  vector  for  that  row  is  always  chosen  such  that  its  state  interval 

a  >  6tj  >  Sj  j. 

Second,  any  row  replacements  required  by  the  replacement  of  some  row  j  will  still  be  required 
after  the  replacement  of  row  j ',  j'  ^  j,  unless  row  j  also  required  the  replacement  of  row  j',  and 
the  state  interval  index  of  the  new  row  j'  is  greater  than  that  required  by  row  j,  in  which  case  this 
new  row  j'  would  still  be  required  if  row  j’s  requirement  had  been  met  first. 

Thus,  the  dependency  vector  left  in  each  row  of  D  when  the  algorithm  terminates  will  always 
have  the  maximum  state  interval  index  of  any  vector  placed  in  that  row  during  the  loop  of  step  3, 
regardless  of  the  order  that  the  row  replacements  are  made.  □ 

Lemma  6  When  state  interval  o  of  process  k  becomes  stable,  the  basic  algorithm 
finds  some  recoverable  system  state  D  =  [£,„]  with  6k  k  =  cr,  if  any  such  system  state 
exists. 

Proof  Any  system  state  found  must  be  recoverable  since  only  stable  process  state  intervals  are 
included  by  the  algorithm,  and  the  resulting  system  state  is  checked  for  consistency.  As  each  row 
of  D  is  replaced,  the  dependencies  that  must  be  satisfied  grow  as  little  as  possible  with  the  stable 
process  states  that  are  available,  as  shown  in  Lemma  4.  Since  the  state  interval  for  any  process  used 
in  D  never  decreases  as  the  algorithm  executes,  the  state  interval  for  process  k  in  any  recoverable 
state  found  will  never  be  less  than  a.  If  the  algorithm  finds  that  it  needs  any  state  interval  of 
process  k  greater  than  a,  no  recoverable  state  is  possible,  since  the  fact  that  state  interval  o  of 
process  k  is  now  stable  has  no  effect  on  such  a  system  state,  and  any  such  recoverable  state  that 
exists  would  have  already  been  found  by  some  earlier  execution  of  the  algorithm  □ 
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Lemma  7  No  stable  process  state  interval  that  was  deferred  in  step  3a  needs  to  be 
rechecked  until  step  4  advances  the  current  recovery  state. 

Proof  Suppose  some  state  interval  o  of  process  k  becomes  stable  and  the  algorithm  determines 
that  no  new  recoverable  state  is  possible.  By  Lemma  6,  this  means  that  no  consistent  set  of  stable 
process  state  intervals  A  =  [a..]  is  available  with  ctkk  =  <?■ 

Now  suppose  some  new  state  interval  o'  of  process  k'  becomes  stable.  If  the  algorithm  deter¬ 
mines  that  no  new  recoverable  state  is  possible,  there  is  no  consistent  set  of  stable  process  state 
intervals  B  =  [/?,«]  available  with  /?*/*'  =  o'.  The  only  effect  that  this  new  state  interval  o'  of 
process  k'  can  have  on  the  earlier  evaluation  of  state  interval  o  of  process  k  is  that  some  recoverable 
state  may  now  be  possible  with  the  state  interval  index  of  process  k'  set  to  o',  but  the  algorithm 
has  already  determined  that  no  such  recoverable  state  is  possible.  Thus,  there  is  no  need  to  recheck 
any  earlier  deferred  stable  process  states  in  this  case.  □ 

This  lemma  shows  when  it  is  necessary  to  recheck  any  deferred  stable  state  intervals.  It  also 
gives  a  method  to  greatly  limit  the  set  of  those  deferred  stable  state  intervals  that  need  to  be 
rechecked,  rather  than  rechecking  all  such  state  intervals  that  are  not  yet  included  in  the  current 
recovery  state. 

Corollary  5  When  the  current  recovery  state  advances  from  R  =  [p..]  to  some  new 
state  R'  =  [/)'„.],  the  stable  process  states  that  were  deferred  earlier  by  step  3a  and 
should  now  be  rechecked  are  those  with  a  direct  dependency  on  some  state  interval  o 
of  any  process  i  such  that  pi,  <  o  <  p'ii. 

Proof  Th?  proof  follows  directly  from  the  proof  of  Lemma  7.  □ 


3.4  Correctness 

Theorem  3  The  recovery  state  algorithm  always  finds  the  current  recovery  state  of 

the  system. 

Proof  First,  by  Lemma  6,  the  algorithm  only  finds  recoverable  system  states.  Also,  any  such 
system  states  found  will  be  greater  than  the  previous  current  recovery  state  since  at  least  the  new 
state  interval  o  for  process  k  is  always  greater  than  the  previous  state  interval  index  for  process 
k  in  the  current  recovery  state.  Lemma  6  also  shows  that  if  some  new  recoverable  state  can  be 
formed  when  state  interval  a  of  process  k  becomes  stable,  the  algorithm  finds  one.  Lemma  7 
shows  when  it  is  necessary  to  recheck  any  process  state  interval  that  could  not  be  added  to  a 
new  current  recoverable  state  when  it  became  stable,  and  Corollary  5  shows  which  state  intervals 
should  be  rechecked  then.  By  rechecking  all  those  state  intervals  at  the  correct  times,  the  maximum 
recoverable  state  must  be  found.  □ 


3.5  An  Efficient  Procedure 

The  algorithm  described  in  Sections  3.2  and  3.3  can  be  implemented  efficiently  by  making  some 
observations  about  the  execution  of  the  algorithm,  based  on  Lemma  5. 

When  step  3  examines  the  dependency  matrix  D  on  each  iteration,  there  may  be  many  pairs 
of  i  and  j  for  which  6t]  >  6jj ,  indicating  that  several  different  rows  of  the  matrix  need  to  be 
replaced.  These  required  row  replacements  can  be  entered  into  a  list  of  pending  replacements  as 
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each  is  discovered.  Since  initially  only  row  k  of  D  has  been  changed  from  the  current  recovery 
state,  and  since  on  each  iteration,  only  one  row  is  replaced  at  a  time,  only  the  single  changed  row 
needs  to  be  compared  against  the  diagonal  elements  of  D  for  consistency.  Then,  only  the  diagonal 
elements  of  the  matrix  are  needed  during  the  execution  of  the  algorithm.  The  list  of  pending  row 
replacements  only  needs  to  remember  the  maximum  index  of  any  state  interval  needed  for  each 
process,  since  the  dependency  vector  that  the  algorithm  leaves  in  each  row  of  D  is  the  one  for  that 
process  with  the  maximum  state  interval  index,  regardless  of  the  order  that  the  replacements  are 
performed. 

The  function  FIND-RV ,  shown  in  Figure  3,  is  a  procedure  to  to  implement  steps  1  through  3 
of  the  basic  algorithm,  taking  advantage  of  these  observations.  This  procedure  finds  a  new  recov¬ 
erable  state  based  on  state  interval  a  of  process  k,  if  such  a  state  exists.  The  list  of  pending  row 
replacements  is  maintained  in  NEED,  such  that  NEED[i]  is  always  the  maximum  index  of  any  state 
interval  in  process  i  that  is  currently  needed  to  replace  row  i  of  the  matrix.  If  no  row  replacements 
are  currently  needed  for  some  process  j,  then  NEED\j]  is  set  to  _L.  A  vector,  RV,  is  used  instead 
of  the  full  dependency  matrix,  where  RV[i]  is  diagonal  element  t  of  the  corresponding  dependency 
matrix,  which  is  also  the  state  interval  index  of  process  t  in  the  recoverable  state.  As  each  row  is 
replaced,  only  the  corresponding  single  element  of  RV  is  changed. 

Using  function  FIND-RV,  the  full  recovery  state  algorithm  can  now  be  stated.  This  algorithm, 
shown  in  Figure  4,  initially  calls  FIND-RV  on  the  state  interval  that  just  became  stable.  If  no 
new  recoverable  state  is  found  the  algorithm  exits  since  no  change  in  the  current  recovery  state 
is  possible  from  this  new  stable  state.  If  FIND-RV  returns  success,  the  result  becomes  the  new 
current  recovery  state,  and  the  algorithm  checks  if  any  other  recoverable  states  greater  than  this 
result  can  now  exist.  The  sets  DEFER ?  keep  track  of  those  deferred  stable  process  state  intervals 


function  FIND-RV{RV ,k,<j) 

if  a  <  /ZUffc]  then  return  true; 
for  i  <—  1  to  n  do  NEED[i\  «—  J.; 

RV[k]  *-a\ 
for  i  *—  1  to  n  do 

if  DV%[i]  >  then  NEED[i\  -  DVak[i\, 
while  3  i  such  that  NEED[i\  ^  J_  do 
a  *—  minimum  such  that 

a  >  NEED[i]  and  stable(t,  a); 
if  no  such  a  then  return  false; 
rtV[»]  4-  a; 

NEED[i]  ♦-  _L; 
for  j  «—  1  to  n  do 

if  DVf[j]  >  RV[j)  then 
NEED[j]  4-  ma x(NEED[j],  DVf{j})-, 
return  true; 


Figure  3  Finding  a  new  recoverable  state 
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WORK  <-  {(A;,<t)  }; 
while  WORK  /  0  do 

remove  some  state  (x,0)  from  WORK; 
if  6  >  CRS[x]  then 
for  j  <—  1  to  n  do 

NEWCRS[j }  -  CRS[j ]; 
if  x,  0)  =  true 

then  for  j  <—  1  to  n  do 

for  0  *-  CRS[j]  +  1  to  jVFWCRS[;]  do 
WORK  -  WORtf  u  DEFER? ; 
DEFER?  -  0; 

CRSfj]  -  iV£H/05[j]; 
else  for  j  <-  1  to  n  do 
0  -  DV9x{j); 
if  /?  >  CRS[j]  then 

DEFER?  -  DEFER?  U  {  (x,  0)  }; 


Figure  4  The  recovery  state  algorithm 

that  should  be  rechecked  when  the  current  recovery  state  advances  over  state  interval  0  of  process 
j.  The  set  WORK  keeps  a  list  of  those  deferred  states  that  are  to  be  rechecked  by  the  algorithm 
because  the  current  recovery  state  has  been  advanced. 

4  Related  Work 

A  number  of  fault- tolerance  recovery  methods  based  on  message  logging  and  checkpointing  have 
been  published  in  the  literature.  This  includes  ones  using  pessimistic  logging  protocols  such  as 
Auros  [Borg83],  Publishing  [Powell83],  and  sender-based  message  logging  [Johnson87],  as  well  as 
optimistic  methods  [Strom85].  The  model  and  recovery  state  algorithm  presented  in  Sections  2 
and  3  can  be  applied  to  each  of  these  and  used  to  reason  about  their  correctness. 

Our  model  is  more  general  than  is  required  by  recovery  methods  based  on  pessimistic  message 
logging,  but  the  definitions  of  consistency,  stability,  and  recoverability  still  apply,  and  the  recovery 
state  algorithm  still  computes  the  correct  current  recovery  state.  In  this  case,  the  current  recovery 
state  is  identical  to  the  state  of  the  system  at  the  time  the  failure  occurred,  since  orphan  processes 
are  not  possible.  Since  message  logging  is  synchronous,  however,  a  simpler  recovery  state  algo¬ 
rithm  is  possible  that  takes  advantage  of  the  order  that  information  arrives  on  stable  storage.  In 
particular,  checkpoints  never  add  new  information  for  the  algorithm,  since  messages  are  always 
logged  in  ascending  order  by  the  index  of  the  state  interval  that  they  start  in  their  receivers,  and 
all  messages  received  before  a  checkpoint  have  already  been  logged  before  the  checkpoint  can  be 
recorded. 

Recovery  based  on  optimistic  logging  protocols  requires  the  full  generality  of  our  model,  how¬ 
ever.  Since  orphan  processes  are  possible  when  using  optimistic  logging,  recovery  from  a  failure  is 
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more  difficult.  Any  orphan  processes  must  be  rolled  back  during  recovery  to  achieve  a  consistent 
state.  Since  there  is  no  synchronization  between  message  logging,  checkpointing,  and  computation, 
information  for  the  recovery  state  algorithm  may  arrive  on  stable  storage  at  any  time  and  in  any 
order.  Thus,  the  algorithm  must  be  able  to  make  use  of  all  this  information  in  order  to  advance 
the  current  recovery  state  to  its  maximum  possible  value  at  all  times. 

Our  model  and  algorithm  differ  in  several  ways  from  those  used  for  optimistic  recovery  by 
Strom  and  Yemini  [Strom85j.  First,  Strom  and  Yemini  require  reliable  delivery  of  messages  between 
processes.  As  a  result,  their  definition  of  consistency  differs  from  ours  by  requiring  all  messages 
sent  to  have  been  received.  Our  model  does  not  require  reliable  delivery,  but  it  can  be  incorporated 
easily  by  inserting  a  return  acknowledgement  message  immediately  following  each  message  receipt, 
with  our  definition  of  consistency  remaining  unchanged.  Second,  although  their  system  checkpoints 
processes  in  order  to  shorten  recovery  times  and  release  old  logged  messages  from  stable  storage, 
they  do  not  take  advantage  of  these  checkpoints  in  computing  the  current  maximum  recoverable 
state  in  their  system.  Our  algorithm  uses  both  checkpoints  and  logged  messages  to  compute  the 
maximum  recoverable  state  and  thus  may  find  recoverable  states  that  their  algorithm  does  not. 
Finally,  our  algorithm  requires  only  the  current  state  interval  index  of  the  sending  process  to  be 
carried  in  each  message,  and  requires  only  a  vector  of  direct  dependencies  to  be  maintained  by 
each  process.  In  contrast,  their  method  requires  each  process  to  maintain  a  vector  of  its  transitive 
dependencies,  and  requires  each  message  to  be  tagged  with  this  vector,  which  has  size  linear  in  the 
number  of  processes.  This  added  complexity  does  allow  control  of  recovery  in  their  system  to  be 
more  decentralized  than  in  ours. 


5  Conclusion 

From  a  performance  standpoint,  optimistic  message  logging  protocols  appear  to  be  desirable.  They 
seem  to  constitute  the  right  performance  tradeoff  in  operating  environments  where  failures  are 
rare  and  failure-free  performance  is  of  primary  concern.  The  recovery  state  algorithm  of  Section  3 
represents  an  improvement  on  earlier  work  with  recovery  based  on  optimistic  message  logging  by 
Strom  and  Yemini  [Strom85].  Although  their  algorithm  eventually  achieves  a  recoverable  state, 
this  state  may  not  be  optimal.  Furthermore,  their  methods  require  reliable  communication  and 
seem  more  complex  than  the  method  presented  here. 

This  work  unifies  existing  approaches  to  fault  tolerance  based  on  message  logging  and  check¬ 
pointing  published  in  the  literature,  including  those  based  on  pessimistic  message  logging  meth¬ 
ods  [Borg83,  Powell83,  Johnson87]  and  those  using  optimistic  methods  [Strom85].  By  using  this 
model  to  reason  about  these  types  of  fault-tolerance  recovery  methods,  properties  that  are  inde¬ 
pendent  of  the  message  logging  protocol  used  can  be  deduced  and  proven.  We  have  shown  that 
there  is  always  a  unique  maximum  recoverable  system  state,  which  never  decreases,  and  that  in 
a  system  where  all  messages  received  are  eventually  logged,  the  domino  effect  cannot  occur.  The 
use  of  this  general  model  allows  more  attention  to  be  paid  instead  to  designing  efficient  message 
logging  protocols. 
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